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Abstract 

For the study of citation networks, a challenging problem is modeling the high 
clustering. Existing studies indicate that the promising way to model the high 
clustering is a copying strategy, i.e., a paper copies the references of its neighbour 
as its own references. However, the line of models highly underestimates the 
number of abundant triangles observed in real citation networks and thus cannot 
well model the high clustering. In this paper, we point out that the failure of 
existing models lies in that they do not capture the connecting patterns among 
existing papers. By leveraging the knowledge indicated by such connecting 
patterns, we further propose a new model for the high clustering in citation 
networks. Experiments on two real world citation networks, respectively from a 
special research area and a multidisciplinary research area, demonstrate that our 
model can reproduce not only the power-law degree distribution as traditional 
models but also the number of triangles, the high clustering coefficient and the 
size distribution of co-citation clusters as observed in these real networks. 

Keywords: citation network modeling, high clustering, triangle number, 
growth model 
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1. Introduction 

As a concise mathematical tool, network is widely used to describe the sys- 
tems of interacting components 0, El 01], including social networks, World Wide 
Web and citation networks, to name a few. Among the studies on networks, 
much research attention has been paid to citation networks of papers, patents 
and legal cases 0, 0, H, 0, 0- In particular, the scientific citation networks are 
the research subjects of much literature and it is believed that such studies can 
help us better understand the collaboration of scientists, the exchange of ideas 
and create better scientific impact measures. In this paper, we will focus on 
scientific citation networks. 
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Figure 1: Illustrations of (a) the forest fire model, (b) the triadic closure model and (c) the 
DAC model proposed in this paper. Here the node t is a new node and we assume its out 
degree is 3. In the forest fire model, i firstly connects to an old node j randomly and then 
links to some of j's out- and in- neighbours, k and x for example. In the triadic closure 
model, i firstly connects to an old node j through preferential attachment and then with some 
probability links to one of j's neighbours, node k as an example. Then i attach an arc to one 
of j's or fc's neighbours, y. In the DAC model, i firstly connects to an old node j according 
to preferential attachment and then connects to j's neighbours considering the connecting 
pattern among them, such as the clique structure. Here, nodes j, x, y form a clique and thus 
are preferred to be connected by node i. 



One outstanding challenge of the studies on citation networks is to find the 
mechanism which governs the growth of citation networks. For this purpose, 
many works have been done to investigate and model the growth of citation 
networks 0, [13, U, 12, 13, [HI, 0]. Among the methods for citation network 
modeling, growth models are widely used with the considerations that papers 
in citation network are added sequentially and all the out-links of a paper are 
generated when it joins the network. In a growth model, the key is to determine 
the papers which will be cited by the new paper. Existing models addressed such 
problem_ using certain preferential attachment mechanisms, involving the in- 



degree (3,11,1 



the age [H, U0|, [TT1, (H 0, [T^ and the content similarity [H E^. 
These models perform well at reproducing the power-law degree distribution. 
However, they underestimate the number of triangles and thus fail to model 
the high clustering in citation networks, which is closely related with network 
transitivity and the formation of communities [20j . 

The common practice to produce triangles is a copying strategy |2C 



21|, 



i.e., a node copy the links of its neighbour as its own, partially or completely. 



Two typical models are the forest fire model proposed by Leskovec et al. [22 
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Figure 2: (a) The growth of triangle number Tj as a function of the network size i of hep-th 
data, forest fire model and triadic closure model for the data. The hep-th data depict the 
citation relations among the preprints on high-energy theory archive posted at www.arxiv.org 
between 1992 and 2003. For the forest fire model the parameters are the same as in l'22f . and 
for the triadic closure model the parameters are the same as in [23l |. (b) The link density of 
the reference graph as a function of node's out-degree in the hep-th data, forest fire model 
and triadic closure model for the data. k ou t denotes the out-degree of the node and D is the 
average link density of reference graphs of nodes with the same out-degrec. 



and the triadic closure model proposed by Wu et al. [23| , as shown in Fig. [TJ 
In the forest fire model, a new paper randomly cites an existing paper and 
then cites its references and its citing papers with certain probability. In the 
triadic closure model, a new paper either cites an existing paper according 
to certain preferential attachment mechanism or cites the papers cited by the 
new paper's references. To our surprise, although these two typical models are 
designed with the goal to form abundant triangles, they highly underestimate 
the number of triangles observed in real world networks, as shown in Fig.[2ja|3- 
One possible cause of the underestimation lies in the copying strategy to form 
triangles. Specifically, when a new paper copies the links of its neighbours, it 
ignores the existing connections among the targets which are the papers citing 
or cited by the new paper's references. As shown in Fig. [TJ both the forest fire 
model and the triadic closure based model are blind to the fact that there exits 
an link between the target papers x and y and thus miss the chance to form 
more triangles through citing them. 

In this paper, by leveraging the knowledge ignored by the forcmcntioncd 
two models, we propose a new model to model the high clustering in citation 
networks. We further verify the effectiveness of our model using two real world 
citation networks, respectively from a special research area and a multidisci- 
plinary research area. Experimental results demonstrate that our model can 
reproduce not only the power-law degree distribution as traditional models but 



1 In [23ll the number of triangles is claimed to agree with the real data. However, lots of the 
generated triangles are duplicate and in this paper the results arc calculated after removing 
those duplicates. 
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also number of triangle, the high clustering coefficient and the size distribution 
of co-citation clusters as observed in these real networks. 

The rest of this paper is organized as follows. In Section [3J we analyze the 
structural characteristics of the reference graph of papers in a real citation net- 
work. Here, reference graph of a paper characterizes citation relations among 
the references of this paper. Based on the analysis results, in Section O we 
propose our DAC model to modeling the high clustering in citation networks. 
Section [f] describes the experimental results by applying our model to model 
two real networks. Finally, Section [5] concludes this paper and gives some dis- 
cussions. 



2. The reference graphs in the real data 

Before giving a model for citation network, we first analyze a real world 
citation network, the hep-th network, to provide some intuitive indications for 
designing an appropriate model. Our analysis is conducted on the reference 
graph of each paper. A reference graph of a paper characterizes the citation 
relations among the references of the paper. For a given paper, its reference 
graph can be viewed as its "ego-graph" or "ego-network" but excluding itself 
and the papers citing it. The structure of a reference graph provides us a 
complete picture about the connecting status among papers before they are 
really cited. Therefore, the analysis on such a graph is critical to find clues for 
the microscopic mechanisms governing the evolution of citation networks. 

As an example, Fig. [3ja) shows the reference graph of a paper in the hep-th 
data. We can see that nodes in the graph are connected into a single component. 
This indicates that when authors cite one paper they also tend to cite the paper's 
neighbours, i.e., papers in the paper's references or papers citing the paper. 
This phenomenon reflects the reading behaviour of researchers, i.e., when they 
are interested in a paper they are very likely to be interested the papers in 
its references and papers citing it. From Fig. EHa), we can also find that the 
reference graph has a very high link density. Fig. HJb) shows the link density 
of reference graphs with respect to the out degrees of papers. It is clear that 
the link density of a paper's reference graph is correlated with its out-degree. 
This phenomenon may be attributed to the facts that papers with high out- 
degrees are usually reviews or surveys and thus their reference graph have lower 
link density while papers with low out-degrees are papers on a specific topic. 
Furthermore, we can find such a phenomenon cannot be well modeled by the 
existing two typical models for high clustering in citation network. In particular, 
the link density of the papers with high out-degrees are largely underestimated. 

We further find that the reference graph contains many cliques with large 
sizes. A clique is a subgraph within which every two nodes are connected. 
Abundant cliques are crucial to high clustering [24| and community structure 
25 , 26L 27 1 . As shown in Fig. 03a), the reference graph contains two cliques 



with size 7 and the average clique size of the graph is about 4.56. A large clique 
may contain many smaller cliques. In this paper, we use the maximal clique to 
avoid the repetitive counting. Fig. [3jb) illustrates the distribution of the size 
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Clique Size 



Figure 3: (a) The reference graph of node 19000 in the hep-th data. This node corresponds 
to the paper "Brane world from IIB matrices" with DOI 10.1103/PhysRcvLctt.85.4664. The 
graph contains 19 nodes and 58 arcs, (b) The cumulative distributions P of the average clique 
size and maximum clique size in the reference graphs. For the average clique size of one 
reference graph, we find out all the maximal cliques in the reference graph, sum up their sizes 
and then make an average. While for the maximum clique size, it refers to the size of the 
largest maximal clique in the reference graph. We can see that about 58% reference graphs 
have a maximum clique with size no less than 5, and about 25% graphs have the average 
clique size larger than 5. Moreover, 11% graphs have a maximum clique no less than 10. Old 
nodes have few out-degrees and to make observations clear here the result is gained using the 
reference graphs of the newest 20% nodes. 



of maximal cliques. The formation of these cliques roots in that authors always 
cite a group of papers which are closely related. Take the literature of research 
on citation network as an example: In 2005, a paper k Q revealed long-term 
systematic features of citation statistics based the observations on a period of 



real data. Later on, a paper j provided a model for the aging characteristics 



in citation networks and cited k as a reference. Recently, Wu et a/.'s paper i 



23j integrated the aging and triadic closure mechanisms to model the citation 



patterns and cited both j and k, which brings a 3-clique ijk. As research on 
this problem goes on, new papers (such as this paper) will cite these formers 
and larger cliques will emerge. Thus, highly connected structure, such as clique, 
indicates topical correlations among the nodes in it. When a paper cites one 
node in a clique, with a high probability it will cite others also in the clique. 
Besides, a paper prefers to cite those with large in-degrec (popular) and small 
age (undergoing recognition). Therefore, in a growth model in-degree and age 
are always taken into the preferential attachment. 



3. The DAC model 

On the basis of above observations, we propose our model for citation net- 
works - the Degree-Aging preferential attachment and Clique neighbourhood 
attachment model, DAC model for short. It is a growth model in which nodes 
join the network sequentially and attach their arcs to the old ones. In citation 
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networks, nodes are ordered temporally, i.e., they joined the network according 
to their ages. In our DAC model we keep the orders and out-degree of nodes 
the same as in the original data. It is innocuous to take the out-degree as given 
information because the out-degree of each paper is decided by its authors and 
most of the time we concern about the in-degrce. As its name explains, the 
DAC model is composed of two parts, 

• the degree-aging preferential attachment. A new node i firstly originates 
an arc to an old node j according to the probability J\ij oc fcj™ x t~ a , where 
k" 1 is j's in-degree, tj = i — j is the age of j and a > is the decaying 
parameter. Actually, this power-law form of probability function is widely 
adopted to model degree-aging preferential attachment in the literature, 
such as done in the Dorogovtsev-Mendes (DM) model [l?} and the model 
in [HI]. 

• the clique neighbourhood attachment. With probability /3 (0 < /3 < 1), 
node i chooses to link j's clique neighbours, i.e., the nodes in the same 
clique j belongs to, as illustrated in Fig. [2(c). Node j may belong to 
many cliques and i randomly chooses one proportional to the clique's size 
and links all the nodes in the clique. Otherwise, i.e., with probability 1 — (3 
or when there are no clique neighbours i can connect to, i attaches an arc 
using the degree-aging preferential attachment as above. Here j is one of 
i's neighbours. 

Wc repeat above attachment mechanisms to fill up the remaining out-degrees 
of i. Obviously, the clique neighbourhood attachment takes the connecting pat- 
terns of the potential neighbours into account and guides the formation of tri- 
angles. By tuning the parameter /3 we can control the growth rate of clustering, 
i.e., larger (3 produces larger clustering. 

4. The data and modeling results 

In this section, we examine the DAC model on the following two real-world 
citation networks. 

• hep-th data, which comes from preprints on the high-energy theory archive 
posted at www.arxiv.org between 1992 and 2003. It contains 27, 770 
preprints after cleaning. 

• PNAS data, which contains 23, 572 articles published by the Proceedings 
of the National Academy of Sciences (PNAS) of the United States of 
America from 1998 to 2007. We crawled the data at the journal's website 
( http : / / www . pna s . org| ) in May 20080. 



2 We removed the isolated nodes in the two data as we are going to model the citation 
patterns of citation networks and these nodes matter nothing in this study. 
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We choose the two networks because they provide data with different types, 
i.e., one is on a special research area and the other is on multidisciplinary sci- 
ences. The basic structural statistics of the two data are listed in table [T] It 
shows that the two networks are comparable in network size while the hep-th 
network is much denser than PNAS. Since a large fraction of articles on the high- 
energy theory is put at www.arxiv.org, the inner citations in the hep-th data is 
very dense. While for PNAS data, papers broadly span physical, biological and 
social sciences, therefore the inner citations are much lower. 

As we intend to model the clustering features in citation networks, three 
quantities are observed here: the number of triangles, the clustering coefficient 
and the link density of reference graph. The triangle number of the network is 
the basic statistic of clustering structures and its growth as a function of network 
size provides insights of how the clustering evolves. The average clustering 
coefficient for the network gives an overall indication of the clustering in the 
whole network. We also analyze the average clustering coefficient of vertices 
with the same degree as a function of the degree, because this correlation is 
a useful function to understand the local structure of the network. For the 
link density of reference graph, it is used to validate the matching of the real 
data and our model in selecting neighbours. Besides these statistics, the basic 
structural quantity, i.e., the in-degree distribution, is also measured here. 

Table 1: Basic statistics of the hep-th data and PNAS data. N, L, A and C denote 
the number of nodes, arcs, triangles and average clustering coefficient [ij in the empirical 
networks. Agj denotes the triangle number in the networks generated by the E-R random 
graph model. Ap^c and C'dac denote the triangle number and average clustering coefficient 
in the networks generated by the DAC model. The results of E-R model and DAC model are 
averaged over 100 independent realizations. 



Measures /Networks 


hep-th 


PNAS 


N 


27,770 


23,572 


L 


352,768 


40,853 


A 


1478,735 


13,225 


&DAC 


1484,004 ±3813 


13,336±i72 


&ER 


2742 ±51 


7±2 


C 


0.312 


0.171 


Cdac 


0.354 ± o.oo5 


0.186± .oo2 



The numerical results are shown in tabled] and Fig.|H We find that although 
the two data are very different in nature, many structural characteristics are 
shared, i.e., the in-degrec distributions both follow a power law, the triangle 
numbers are both much larger than random networks and the number of trian- 
gles both follow a similar growth law as a function of the network size. For the 
performance of our DAC model, in table [1] we see the number of triangles and 
the average clustering coefficient are both matched for the two data, which con- 
firms that our model can reproduce the clustering features of citation networks. 
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Figure 4: The in-degree distribution, growth of triangle number Ti as a function of network 
size i, the average clustering coefficient C as a function of node's degree k and the link density 
of reference graph D as a function of node's out-degree k ou t of the two empirical networks and 
the DAC model. Plots (a), (b), (c) and (d) are for hep-th data and (e), (f), (g) and (h) are 
for PNAS data. Parameters in the model are scanned in their reasonable ranges and gained 
by the best fit for the empirical data, i.e., a = 1 and = 0.48 for hep-th data and a = 1 and 
= 0.44 for PNAS data. The results are averaged over 100 independent realizations. 




Figure 5: The cumulative size distribution of co-citation clusters for the two empirical networks 
and the DAC model. The clusters are gained by clique percolation method l25f with k = 4. 
In both figures, P denotes the cumulative distribution and S denotes the size of clusters. 



Detailed comparisons are shown in Fig. 2] For the hep-th data, as Fig. 0|a) 
shows, the in-degree distribution is well fitted. In Fig. EJb), we can see that not 
only the final number, but also the growth of the triangle number is remarkably 
matched between our model and the empirical data. Fig. 0Jc) reveals that the 
average clustering coefficient decays with the node's degree in the data and this 
feature is captured by our DAC model. The fourth quantity is the link density 
of reference graph that we show in Fig. Bid). The relationship between link 
density and out-degree is well reproduced by the DAC model. For the PNAS 
data, the four statistics observed here are all well reproduced by the model too. 

Besides the microscopic clustering statistics such as number of triangles and 
clustering coefficient, we also investigate the size distribution of co-citation clus- 
ters [28} to verify the effectiveness of our model. For a given citation network, 
we first construct a co-citation network, in which nodes are papers and two 
nodes add one link once their corresponding papers are cited by a same paper. 
The co-citation network is undirected and weighted with weight on edge 
measured in terms of cosine coefficients between the two sets of papers that cite 
i and j respectively [28j]. Then we use the clique percolation method (CPM) 



25| to identify co-citation clusters in the co-citation network. As CPM requires 



the network to be unweighted, we remove all edges with weights smaller than a 
threshold w* and w* is determined using the method provided in pBj . Fig. [5] 
shows the size distributions of co-citation clusters for hep-th network and PNAS 
network. We can see that the DAC model generates comparable size distribu- 
tions as the real data. Moreover, the size distributions of the two networks both 
have broad ranges, which is in agreement with the results in [25^ . 



5. Conclusion and Discussion 

In this paper, we focused on modeling the clustering features in citation net- 
works. We observed that the reference graphs are always highly connected and 
contain lots of cliques, which helps the formation of clustering in the network. 
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Based on these observations, we proposed a growth model, the DAC model, for 
citation networks. The model adds nodes one by one and fills up the nodes' 
out-degrees taking advantage of two attachment mechanisms: the degree-aging 
preferential attachment and the clique neighbourhood attachment. We vali- 
dated the model by comparing four quantities, the in-degree distribution, the 
growth of triangle number, the average clustering coefficient, the link density 
of reference graphs and the size distribution of co-citation clusters on two real- 
world citation networks. Good agreements are gained for both data by tuning 
parameters in the attachment mechanisms. 

The results on the two real-world data suggest that the attachment mech- 
anisms in the model capture the linking rules of scientific citation networks: a 
paper prefers to cite recent and popular ones and this helps to form the degree 
distribution of the network. Moreover, a paper tends to cite its neighbours' 
clique neighbours and this helps to form the clustering. This work is a step for- 
ward in the modeling of citation networks and will provide insights for further 
studies such as the formation of subgraphs. 

In this paper we provide one way to incorporate the topological informa- 
tion of the potential neighbours and better methods are worth being explored. 
Nodes in citation networks are always documents, thus textual or semantical 
information may be helpful in the preferential attachment mechanisms and the 
previous works (l8l . Il9j give us some indications. Moreover, high clustering is 
a common characteristic in many real-world networks and we will further test 
our mechanisms in modeling the evolutions of other kinds of network, such as 
the social network. 
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